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Abstract The long observational record is critical to our understanding of the Earth’s 

climate, but most observing systems were not developed with a climate 
objective in mind. As a result, tremendous efforts have gone into assessing and 
reprocessing the data records to improve their usefulness in climate studies. 
The purpose of this paper is to both review recent progress in reprocessing 
and reanalyzing observations, and summarize the challenges that must be 
overcome in order to improve our understanding of climate and variability. 
Reprocessing improves data quality through more scrutiny and improved 
retrieval techniques for individual observing systems, while reanalysis merges 
many disparate observations with models through data assimilation, yet both 
aim to provide a climatology of Earth processes. Many challenges remain, 
such as tracking the improvement of processing algorithms and limited spatial 
coverage. Reanalyses have fostered significant research, yet reliable global 
trends in many physical fields are not yet attainable, despite significant 
advances in data assimilation and numerical modeling. Oceanic reanalyses 
have made significant advances in recent years, but will only be discussed 
here in terms of progress toward integrated Earth system analyses. Climate 
data sets are generally adequate for process studies and large-scale climate 
variability. Communication of the strengths, limitations and uncertainties of 
reprocessed observations and reanalysis data, not only among the community 
of developers, but also with the extended research community, including the 
new generations of researchers and the decision makers is crucial for further 
advancement of the observational data records. It must be emphasized that 
careful investigation of the data and processing methods are required to use the 
observations appropriately. 

Keywords Essential climate variables - Climate data records - Data rescue - Data 
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Abstract The long observational record is critical to our understanding of the Earth’s 
climate, but most observing systems were not developed with a climate objective 
in mind. As a result, tremendous efforts have gone into assessing and reprocessing 
the data records to improve their usefulness in climate studies. The purpose of this 
paper is to both review recent progress in reprocessing and reanalyzing observa- 
tions, and summarize the challenges that must be overcome in order to improve our 
understanding of climate and variability. Reprocessing improves data quality 
through more scrutiny and improved retrieval techniques for individual observing 
systems, while reanalysis merges many disparate observations with models through 
data assimilation, yet both aim to provide a climatology of Earth processes. Many 
challenges remain, such as tracking the improvement of processing algorithms and 
limited spatial coverage. Reanalyses have fostered significant research, yet reliable 
global trends in many physical fields are not yet attainable, despite significant 
advances in data assimilation and numerical modeling. Oceanic reanalyses have 
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19 made significant advances in recent years, but will only be discussed here in terms 

20 of progress toward integrated Earth system analyses. Climate data sets are generally 

21 adequate for process studies and large-scale climate variability. Communication 

22 of the strengths, limitations and uncertainties of reprocessed observations and reanal- 

23 ysis data, not only among the community of developers, but also with the extended 

24 research community, including the new generations of researchers and the decision 

25 makers is crucial for further advancement of the observational data records. It must 

26 be emphasized that careful investigation of the data and processing methods are 

27 required to use the observations appropriately. 

28 Keywords Essential climate variables • Climate data records • Data rescue • Data 

29 provenance • Reanalysis • Uncertainty • Bias correction 


30 1 Reprocessing Observations 

31 A major difficulty in understanding past climate change is that, with very few excep- 

32 tions, the systems used to make the observations that climate scientists now rely 

33 on were not designed with their needs in mind. Early measurements were often 

34 made out of simple scientific curiosity or needs other than for understanding climate 

35 or forecasting it; latterly, many systems have been driven by other needs such as 

36 operational weather forecasting, or by accelerating improvements in technology. 

37 This has two major consequences. 

38 The first consequence is that although large numbers of observations are available 

39 in digital archives, many more still exist only as paper records, or on obsolete 

40 electronic media and are therefore not available for analysis. Measurements made 

41 by early satellites, whaling ships, missions of exploration, colonial administrators, 

42 and commercial concerns (to name only a few) are found in archives scattered 

43 around the world. Finding, photographing and digitizing observations from paper 

44 records and locating machines capable of reading old data tapes, punch cards, strip 

45 charts or magnetic tapes are each time-consuming and costly, but they are vital to 

46 improving our understanding of the climate. Furthermore, there is a growing need 

47 for longer, higher quality data bases of synoptic timescale phenomena in order to 

48 address questions and concerns about changing climate and weather extremes, 

49 risks and impacts under both natural climatic variability and anthropogenic climate 

50 change. Such demands are leading to a greater emphasis on the recovery, imaging, 

51 digitization, quality control and archiving of, plus ready access to, daily to sub-daily 

52 historical weather observations. These new data will ultimately improve the quality 

53 of the various reanalyses that rely on them. There is also a sense of urgency as many 

54 observations are recorded on perishable media such as paper and magnetic tapes 

55 which degrade over time. Without intervention, our ability to understand and recon- 

56 struct the past is disintegrating in a disturbingly literal sense. 

57 The second major consequence is that current observation system requirements 

58 for climate monitoring and model validation such as those specified by GCOS 
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(http://www.wmo. int/pages/prog/gcos/index.php?name=ClimateMonitoringPrincip 
les) - typically emphasizing continuity and stability over resolution and timeliness - are 
met by few historical observing systems. Changes in instrumentation, reporting 
times and station locations introduce non-climatic artifacts in the data necessitating 
consistent reprocessing to recover homogeneous climate records. Nevertheless, 
reliable assessments of changes in the global climate have been made such as the 
IPCC’s statement that “warming of the climate system is unequivocal”. This assess- 
ment relies on the many multi-decadal climate series which now exist. 

Reprocessing of observations aims to improve the quality of the data through 
better algorithms and to understand and communicate the errors and consequent 
uncertainties in the raw and processed observations. Reanalyses differ from repro- 
cessed observational data sets in that sophisticated data assimilation techniques 
are used in combination with global forecast models to produce global estimates 
of continuous data fields based on multiple observational sources (to be discussed 
in the following section). 


1.1 Data Recovery and Archiving 

A vital first step for the understanding of historical data and hence past climate is to 
digitize and make freely available the vast numbers of measurements, other obser- 
vations and related metadata that currently exist only in hard copy archives or on 
inaccessible (or obsolete) electronic media. Some estimates suggest that the number 
of undigitized observations prior to the Second World War is larger than the number 
of observations currently represented in the largest digital archives. 

Digitizing large numbers of observations that are printed or hand-written in a 
variety of languages is labor intensive: imaging fragile paper records is time con- 
suming and optical character recognition (OCR) technology is not yet capable of 
dealing with handwritten log book or terrestrial registers entries, so they must be keyed 
[AU2] by hand. Scientific projects such as CL1W0C (Garcfa-Herrera et al. 2005), RECLAIM 
(Wilkinson et al. 2011) and the international ACRE initiative (Atmospheric 
Circulation Reconstructions over the Earth, Allan et al. 2011) have worked to recover 
and make available these observations. More recently they have been supplemented 
by citizen science projects such as oldweather.org (http://www.oldweather.org) and 
Data.Rescue@Home (http://www.data-rescue-at-home.org/) which have reliably 
and rapidly digitized large numbers of meteorological observations online at the 
same time as increasing public engagement with science via lively e-communities. 
Such projects are not only of climatological interest but can also be of wider historical 
interest (Allan et al. 2012). 

The international ACRE initiative (Allan et al. 2011) both undertakes data rescue 
and facilitates data recovery projects around the world and their integration with 
existing data archives. A number of these data archives exist. The International 
Comprehensive Ocean Atmosphere Data Set (ICOADS Woodruff et al. 2010) 
holds marine meteorological reports covering a wide range of surface variables. 
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100 The World Ocean Database (WOD, Showstack 2009) has large holdings of 

101 oceanographic measurements. The Integrated Surface Database (ISD, Lott et al. 

102 2008) holds high-temporal resolution data for land stations. The International 

103 Surface Pressure Databank (ISPD, Yin et al. 2008) contains measurements of 

104 surface pressure from ICOADS and land stations, supplemented by information 

105 about tropical cyclones from the International Best Track Archive for Climate 

106 Stewardship (IBTrACS, Knapp et al. 2010). The Global Precipitation Climatology 

1 07 Centre (GPCC) has gathered precipitation observations from many different sources. 

108 The International Surface Temperature Initiative (ISTI, Thorne et al. 2011a, b) is 

109 bringing together temperature measurements from many different sources to pro- 

110 vide a single, freely available databank of temperature measurements combined 

1 1 1 with metadata concerning the provenance of the data. Nevertheless, these various 

112 activities are very fragile, and often only exist as a result of ‘grassroots’ actions 

113 by the climate science community (Allan et al. 2011, 2012). These projects and 

114 initiatives urgently need to be imbedded in an overarching, sustainable, fully funded 

1 1 5 and staffed international infrastructure that oversees data rescue activities, and com- 

116 pliments the various implementation and strategy plans and documents on data 

117 through international coordinating bodies, such as GCOS, GEO, WMO and WCRP. 

118 The consolidation of meteorological, hydrological and oceanographic reports 

119 and observations into large archives facilitates the creation of a range of ‘summary’ 

120 data sets which are widely used in climate science and can also act as a focus for 

121 an international community of researchers. However, further consolidation could 

122 bring greater benefits. A land equivalent of the ICOADS, for example, would bring 

123 together many of the elements needed to fully describe the meteorological situation 

124 and potentially reduce the efforts that are currently expended to maintain and grow 

125 a large number of different datasets. In fact, both the terrestrial and marine data 

126 efforts need to be integrated and better linked up under an international framework 

127 that supports their activities in a fully sustainable manner. 


128 1.2 Data Set Creation and Evaluation 

1 29 The difficulties of converting raw observations into data sets which are of use to climate 

130 researchers are well documented (e.g. Lyman et al. 2010; Thorne et al. 201 la; Kent 

131 et al. 2010; Lawrimore et al. 2011; Hossain and Huffman 2008). Systematic errors 

132 and inhomogeneities in data series caused by changes in instrumentation, time of 

133 observation and in the environment of the sensor are often as large, or larger than, 

134 the signals we hope to detect. Without reliable traceability back to international 

135 measurement standards, the problem of detecting and accounting for these errors is 

136 not easy. Before the satellite era, observations were often sparsely distributed. 

137 Various methods have been devised to impute the values of climatological variables 

138 at locations and times when no such observations were made. The problems are 

139 further compounded by the necessity of making approximations, using uncertain 

140 inputs (such as climatologies), the use of different data archives and having sometimes 


[AU3] 
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limited statistics with which to estimate important parameters. Three examples 
will help to illustrate some of these difficulties and the way that they have been 
tackled. 

One long running example is seen in the different reprocessings of the data from 
the satellite-based Microwave Sounding Units (MSU) which can be used to derive 
vertical temperature profiles through the free atmosphere (Thorne et al. 201 la). The 
earliest processing by Spencer and Christy (1990) suggested a monthly precision 
of 0.01° C in the global average lower troposphere temperatures but the lack of a 
trend in the satellite data was not physically consistent with contemporary surface 
temperature estimates. However, when other teams (Prabhakara et al. 2000; 
Vinnikov and Grody 2003; Mears et al. 2003) processed the data they found quite 
different long term behavior. Successive iterations of the datasets have considered 
an increasingly broad range of confounding factors including orbital decay, hot 
target temperature and diurnal drift. Twenty years of analysis and reprocessing 
have undoubtedly improved the overall understanding of the MSU instruments 
(Christy et al. 2003; Mears and Wentz 2009a, b), the quality of the data sets and 
estimates of atmospheric temperature trends, but despite these improvements 
temperature trends from the different products still do not agree. This implies either 
the existence of unknown systematic effects, or significant sensitivity to data 
processing choices. Mears et al. (2011) used a monte-carlo approach to assess the 
uncertainty arising from data processing choices, but this did not fully bridge the 
gap between their analysis and others. 

In the past decade, the view of ocean heat content has changed considerably. 
Early estimates of global ocean heat content (Levitus et al. 2000) showed marked 
decadal variability. Gouretski and Koltermann (2007) identified a time-varying bias 
in measurements made by expendable Bathy Thermographs (XBT). An XBT is a 
probe that is launched from the deck of a ship and falls down through the ocean 
trailing behind it a fine wire that relays water temperature measurements to the 
operator. The depths of the measurements are estimated from an equation that 
relates time-since-launch to depth. Gouretski and Koltermann (2007) found that 
there were time-varying differences between the actual and estimated depths. Since 
2007, various groups (Wijffels et al. 2008; Ishii and Kimoto 2009; Levitus et al. 
2009; Gouretski and Reseghetti 2010; Good 2011) have proposed adjustments for 
the XBT data based on a number of factors including, the make and model of the 
XBT, water temperature (which is related to viscosity) as well as a pure thermal bias 
of unknown origin. By running the different correction methods on a defined set 
of data, it has been possible to begin to assess the uncertainty arising from the 
different parts of the reprocessing e.g. bias adjustment, choice of climatology etc. 
(Lyman et al. 2010). 

The third example provides contrasting depth to the problems at hand. A number 
of sea-surface temperature data sets extend back to the start of the twentieth century 
(and before). Because observations become fewer the further back in time one goes, 
statistical methods are used to estimate SSTs in data gaps. However, as before, the 
data sets differ. Trends in SSTs in the tropical Pacific show different behavior 
depending on the data set used. Some data sets show an El Nino-like pattern, others 
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186 a La Nina-like pattern (Deser et al. 2010) indicating that uncertainty in long-term 

187 trends can arise from sources other than systematic instrumental error. 

188 Because of the obvious difficulties with observationally-based data sets, it is 

189 dangerous to consider them as unproblematic data points which one can use to build 

190 and challenge theories and hypotheses regarding the climate. The reality is not 

191 so simple. The data sets are themselves based on assumptions and hypotheses 

192 concerning the means by which the observed quantity is physically related to the 

193 climatological variable of interest. In the first example given above, the MSUs are 

194 sensitive to microwave emissions from oxygen molecules in the atmosphere. To 

195 convert the measured radiances to atmospheric temperature requires knowledge of 

196 atmospheric structure, the physical state of the satellite, quantum mechanics and 

197 orbital geometry. 

198 In the first two examples above, the earliest attempts to create homogeneous data 

199 series underestimated the uncertainties because they did not consider a wide enough 

200 range of systematic effects. The physical understanding of the system under study 

201 was incomplete. Such problems are not unique to the study of climate data; see 

202 for example, Kirshner (2004) on the difficulties of estimating the Hubble constant. 

203 The uncertainty highlighted by the differences between independently processed 

204 data sets is often referred to as structural uncertainty. It arises from the many different 

205 choices made in the processing chain from raw observations to finished product. 

206 Part of this difference will arise from the different systematic effects considered - 

207 implicitly and explicitly - by the groups, but part will also arise from the different 

208 ways independent groups tackle the same problems. In most cases there are a wide 

209 variety of ways in which a particular problem can be approached and no single 

210 method can be proved definitively to be correct. The uncertainty associated with 

211 small changes in method (for example, using a 99 % significance cutoff as opposed 

212 to 95 % for identifying station breaks) can be assessed using monte-carlo techniques 

213 (see e.g. Mears et al. 2011; Kennedy et al 201 1; Williams et al. 2012) and is referred 

214 to as parametric uncertainty to differentiate it from the deeper - and often larger - 

215 uncertainties associated with more significant structural chances that can only be 

216 assessed by taking independent approaches. 

217 This slow evolution underlies what drives improvements in the understanding of 

218 the data. It also highlights the fact that no reprocessing is likely to be final and 

219 definitive. These considerations show the ongoing importance of making multiple, 

220 independent data sets of the same variable and many analyses that rely on climate 

221 data sets use multiple data sets to show that their results are not sensitive to struc- 

222 tural uncertainty. 

223 Comparisons between different methods have been used to assess the relative 

224 strengths and weaknesses of different approaches. Side by side comparisons of 

225 existing data sets have been made ( Yasunaka and Hanawa 2011) but the use of care- 

226 fully designed tests datasets can be far more illuminating. Real observations can 

227 be used (e.g. Lyman et al. 2010), but in this case the ‘true’ value is unknown. By 

228 using synthetic data sets, where the truth is known, much more can be learned 

229 (e.g. Venema et al. 2012; Williams et al. 2012). The use of carefully designed test 

230 data sets has been used in metrology to understand uncertainties associated with 
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software in the measurement chain. However, the National Physics Laboratory 
(NPL) best practice guide on validation of software in measurement systems (NPL 
report DEM-ES 014) excludes measurement systems where the physics is still being 
researched which is arguably the case for many climate data sets. The International 
Surface Temperature Initiative (IST1 Thome et al. 201 lb) is developing a sophisti- 
cated process for developing test data sets based on synthetic ‘pseudo-observations’ 
that have been constructed to contain errors and inhomogeneities thought to be 
representative of real world cases. By running the algorithms designed to homoge- 
nize station data on these analogues of the real world as well as on the real data, it 
will be possible to directly compare the performance of different methods. Tests like 
these have been used to study the effectiveness of paleo-reconstruction techniques 
(Mann and Rutherford 2002) and have long formed the basis of Observing System 
Simulation Experiments (OSSE’s). Ideally, such processes need to be ongoing 
for two reasons. Firstly, benchmark tests become less useful over time because there 
is a danger that the methods will become tuned to their peculiarities. Secondly, 
because the benchmarks might not address novel uses of the data or reflect new 
understanding of the error structures present in real world data. 

Such methods are less effective for assessing homogenization procedures where 
they are based on empirical studies (Brunet et al. 2011), or on physical reasoning 
(Folland and Parker 1995). However, they could be used to cross-check results 
if statistically-based alternatives can be developed. A more empirical approach to 
the problem of assessing data biases is to run observational experiments (Brunet 
et al. 2011) whereby different sensors, including historical sensors, are compared 
side by side over a period of years. Such comparisons can be used to estimate the 
biases and associated uncertainties that can be used to cross check other methods, 
and in periods with fewer observations they may be the only means of assessing the 
data uncertainties. 

Greater emphasis is now being given to the importance of uncertainty in 
observationally-based data sets, but it is not always clear how a user of the data 
should implement or interpret published uncertainty estimates. The traditional 
approach of providing an error bar on a derived value is often unsatisfactory because 
it provides information only on the magnitude of the uncertainty, but not how 
uncertainties co-vary. For example in the schematic in Fig. 1, each of the red lines 
is consistent with the median and 95 % uncertainty range indicated by the black 
line and blue area. By providing only the black line and ‘error bar’, information 
concerning (in this case) the temporal covariance structure of the errors is lost. 
This has implications when the data are further processed, because the covariance 
is needed to correctly propagate the uncertainties. 

Recent approaches have drawn representative samples (roughly equivalent to 
the red lines shown in Fig. 1) from the posterior distributions of statistically 
reconstructed fields (Karspeck et al. 2011; Chappell et al. 2012) or representative 
samples from a particular error model (Mears et al. 2011; Kennedy et al. 2011). 
Each sample, or realization, can then be run through an analysis to generate an 
ensemble of results that show the sensitivity of the analysis to observational 
uncertainty. 


231 

232 

233 

234 

235 

236 

237 

238 

239 

240 

241 

242 

243 

244 

245 

246 

247 

248 

249 

250 

251 

252 

253 

254 

255 

256 

257 

258 

259 

260 

261 

262 

263 

264 

265 

266 

267 

268 

269 

270 

271 

272 

273 

274 

275 



Author's Proof 

M.G. Bosilovich et al. 





Fig. 1 Four examples showing that very different behaviors are consistent with the same ‘error 
bars’. (Top) uncertainty range indicates that high-frequency variability is missing. ( second from 
top ) uncertainty range indicates a systematic offset. ( bottom and second from bottom) uncertainty 
range indicates red-noise error variance 


276 While these issues have been important for assessing large scale long term 

277 climate change, the challenges become even more formidable when data sets are 

278 used to assess climate change at higher resolution in time and in space. It is the 

279 extremes of weather that most often have the highest societal impacts and detecting 

280 and attributing changes in the statistics of these events is hampered by sparse data 

281 and poorly characterized uncertainties (see the OSC Community Paper on Extremes 

282 by Alexander et al.). The analysis of extremes demands more careful quality 
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control - which in turn necessitates greater understanding of the underlying 
processes - because unusual events can sometimes resemble data errors and vice 
versa. In order to provide the data sets demanded by climate services the problems 
detailed above need to be resolved for a new generation of high resolution data set; 
from the discovery imaging and digitizing of paper records and metadata, through 
the management of appropriate archives, the generation of multiple independent 
data sets and their intercomparison to the wide dissemination and documentation of 
the final products. 

Addressing the above concerns is vital for the creation of Climate Data Records 
(CDR http://www.ncdc.noaa.gov/cdr/guidelines.html), defined by the National 
Research Council (NRC) as “a time series of measurements of sufficient length, 
consistency, and continuity to determine climate variability and change”. At the 
moment, the concept of a CDR has been associated with satellite processing, but 
a similar approach would be illuminating for in situ measurements of other geo- 
physical variables. Of particular interest, from this point of view are the importance 
accorded to transparency of data and methods. Openness and transparency have many 
advantages over their opposites. They lay bare the assumptions made in the analysis: 
although methods sections in papers can adequately describe an algorithm, there 
is always the danger of ambiguity, or unstated assumptions. Where computer codes 
are provided, they unambiguously describe the methods used. In addition, the 
discovery and correction of errors in data and analysis are greatly facilitated, as is 
the reuse of methods in later analyses (Barnes 2010). The Climate Code Foundation 
(http://climatecode.org/) has been set up to help improve the visibility, availability 
and quality of code used in climate assessments and has recoded the NASA Goddard 
Institute of Space Studies global temperature data set, which has been developed 
over a number of years, in a single consistent package. 

Assessing the quality of anything is a difficult task (Pirsig 1974) and CDRs are 
no exception. Indices attempting to measure the quality, or maturity of CDRs 
have been proposed (wwwl.ncdc.noaa.gov/pub/data/sds/cdrp-mtx-0008-v4.0- 
maturity-matrix.pdf). These include considerations of criteria such as scientific 
maturity, preservation maturity and metadata completeness as well as highlighting 
the importance of independent cross-checks and the provision of validated uncertainty 
estimates. A concept such as “maturity” is dangerous when applied to a single dataset: 
longevity and quality are not equivalent. As shown above, scientific maturity has 
typically developed by means of making multiple independent data sets. Even when 
considering the understanding of a variable across a range of data sets, difficulties 
arise because systematic errors in the data can go undetected for many years. 
“Immaturity” has only ever been obvious in hindsight. 

Climate research encompasses a large range of studies, from process studies, 
overlapping more traditional research, that focus on large space-time scale interactions 
and coupling (i.e, feedbacks) to global, long-term monitoring (change detection) 
and attribution (change explanation). Planning for the needs of all of these uses 
is difficult. The need for greater transparency and traceability of raw data charac- 
teristics, analysis methods and data product uncertainties also have to help users 
judge whether a particular product is useful for a particular study. Given the large 
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328 range of data products currently available — both raw and analyses — it is sometimes 

329 difficult for users to identify, locate and obtain what they need unless there is an 

330 organized set of information available. A number of approaches can help users find 

331 the data they need. 

332 First, users need information about the various data sets. Journal papers and 

333 technical reports describing data set construction are often less useful as user guides, 

334 with technical details hidden behind journal paywalls or spread across a series of 

335 publications. Initiatives such as the Climate Data Guide project aims to provide 

336 expert and concise reviews of data and quality (http://climatedataguide.ucar.edu/). 

337 By comparing data sets side by side in a common setting, it should be easier for 

338 users to understand the relative strengths and weaknesses of different data sets. 

339 Second, the users need to be able to find the data. This is easiest to do if there 

340 exists a common method for data discovery. At the basic level of individual meteo- 

341 rological reports, there exist a large number of archives (as mentioned before). At a 

342 higher level, there is no single repository for gridded and otherwise processed 

343 observational data sets that is analogous to the CM1P archive of model data (Meehl 

344 et al. 2000). Generating such an archive would have the dual effect of giving users 

345 easy access to the data in a standard format while allowing data producers to get 

346 their work more widely recognized. Presenting different data sets side by side will 

347 also serve to highlight the uncertainties in the observations themselves. A problem 

348 common to all data sets is that of accurate citation. Where data sets are regularly 

349 updated, a citation to a journal paper might not be sufficient to allow full reproduc- 

350 ibility. Data archives could allow systematic version control of data set through 

351 a common mechanism allowing future users to extract a particular data set down- 

352 loaded at any time. There is a growing concern about archiving and ready access to 

353 all of these data under a viable system that can easily handle the storage and access 

354 to an ever expanding volume of data. By combining such an archive with detailed 

355 provenance information, as anticipated by ISTI, would allow users to use data of a 

356 kind that is appropriate for their particular analysis. In gathering together observa- 

357 tional data, thought must also be given to archiving and systematizing metadata 

358 and documentation. Such things as, quality flags, stations histories, calibration 

359 records, reanalysis innovations and feedback records, observer instructions, and so 

360 on, provide valuable information for analysts. Ideally, archives of metadata should 

361 coexist with the archives of data to which they refer. 

362 Third, the information and data sets need to be integrated. There is not as yet a 

363 systematic way to gather value that has been added by a community that works with 

364 the data. The Climate Data Guide points to the data, but the data exist in a variety of 

365 formats. Collections of data sets exist, but they are sometimes divorced from the 

366 expert guidance necessary to understand them. A number of initiatives are addressing 

367 these problems. The ICOADS does incorporate some information concerning 

368 quality control, or bias identification and adjustment, but the IVAD (ICOADS 

369 Value-Added Data http://icoads.noaa.gov/ivad/) data base plans to add a layer which 

370 will give users access to a range of value-added data. The ISTI (International Surface 

371 Temperature Initiative) plans to create an archive of air temperature data and go 

372 further by planning to include other variables, as well as full provenance information 
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for each observation in the archive allowing users to drill down from fully analyzed 
products to the original handwritten note made by the observer. Other projects, 
such as Group for High Resolution Sea Surface Temperature (GHRSST, www. 
ghrsst.org; Donlon et al. 2007), have produced alternative models for their own user 
communities that give access to greater detail allowing them to make their own 
evaluations of uncertainty. 


1.3 Recommendations 

1. Projects and initiatives concerning data digitization and archiving of basic 
observations urgently need to be imbedded in an overarching, sustainable, fully 
funded and staffed international infrastructure that oversees data rescue activities, 
and compliments the various implementation and strategy plans and documents 
on data coming out of GCOS, GEO, WMO, WCRP and the like. 

2. Terrestrial and marine data efforts need to be integrated and better linked up 
under an international framework that supports their activities in a fully sustainable 
manner. 

3. An archive of observational data sets analogous to the CMIP archive of model 
data, should be set up and integrated with user-oriented information such as the 
Climate Data Guide. 


2 Reanalysis of Observations 

Reanalyses differ from reprocessed observational data sets in that sophisticated data 
assimilation techniques are used in combination with global forecast models to 
produce global estimates of continuous data fields based on multiple observational 
sources. One advantage of this approach is that reanalysis data products are available 
at all points in space and time, and that many ancillary variables, not easily or 
routinely observed, are generated by the forecast model subject to the constraints 
provided by the observations. An important disadvantage of the reanalysis technique, 
however, is that the effect of model biases on the reanalyzed fields depends on the 
strength of the observational constraint, which varies both in space and time. This 
needs to be taken into account when reanalysis data are used for weather and climate 
research (e.g. Kalnay et al. 1996). Nevertheless, recent developments in data 
assimilation techniques, combined with improvements in models and observations 
(e.g. due to reprocessing of satellite data) have led to increasing use of modern 
reanalyses for monitoring of the global climate (Dee and Uppala 2009; Dee et al. 
2011b; Blunden et al. 2011). 

With multiple reanalyses now available for weather and climate research, inves- 
tigators must consider the strengths and weaknesses of each reanalysis. Estimates of 
the basic dynamic fields in modern reanalyses are increasingly similar, especially in 
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410 the vicinity of abundant observations (Rienecker et al. 2011). The physics fields 

411 (e.g. precipitation and longwave radiation) are more uncertain due to shortcomings 

412 in the assimilating model and its parameterizations. Understanding the effect of model 

413 errors is important both for users and developers of reanalyses, and ultimately 

414 needed to further improve the representation of climate signals in reanalysis. 

415 Observations provide the essential information content of reanalysis products; their 

416 quality and availability ultimately determines the accuracy that can be achieved. 

417 The types of observations assimilated span the breadth of remotely sensed and 

418 instrumental in-situ observations. Dealing with the complexities and uncertainties 

419 in the observing system, including data selection, quality control and bias correction, 

420 can have a crucial effect on the quality of the resulting reanalysis data. 

421 Given the importance of reanalysis for weather and climate research and applica- 

422 tions, successive generations of advanced reanalysis products can be anticipated. 

423 In the near future, coupling ocean, land and atmosphere will allow an integrated 

424 aspect of the reanalysis of historical observations, but may also increase the presence 

425 of model uncertainty. However, with the complexity of all the components of 

426 the Earth system, realizing the true potential of such advancements will require 

427 coordination, not only among developers of future reanalyses but also with the 

428 research community. 


429 2.1 Current Status 

430 The most used and cited reanalysis is the NCEP/NCAR reanalysis, which includes 

431 data going back to 1948 (Kalnay et al. 1996). The 45 year ECMWF reanalysis 

432 (ERA-40, Uppala et al. 2005), which stops in August 2002, has also been extensively 

433 used in weather and climate studies. Both of these reanalyses span the transition 

434 from a predominantly conventional observing system (broadly referring to in situ 

435 observations and retrieved observations that are assimilated) to the modern period 

436 with abundant satellite observations, marked by the introduction of TOVS radiance 

437 measurements in 1979. Many spurious variations in the climate signal have been 

438 identified in these early-generation reanalyses (Bengtsson et al. 2004; Andersson 

439 et al. 2005; Chen et al. 2008a, b), mainly resulting from inadequate bias corrections 

440 of the satellite data and modulated effects of model biases related with changes in 

441 the observing system. There now exist several atmospheric reanalyses covering the 

442 post-1979 period that are being continued forward in near-real time. The Japanese 

443 25-year Reanalysis (JRA-25), released for use in March 2006 (Onogi et al. 2007) is 

444 the first effort by the JMA, and their second, JRA-55 is underway (Ebita et al. 2011). 

445 The National Centers for Environmental Prediction (NCEP) second reanalysis 

446 (NCEP-DOE, Kanamitsu et al. 2002) improved upon the NCEP/NCAR reanalysis 

447 data. More recently, ECMWF has produced the ERA-Interim reanalysis based on a 

448 2006 version of their data assimilation system (Dee et al. 2011a), in preparation for 

449 a new climate reanalysis to be produced starting in 2014. NASA’s Modern Era 

450 Retrospective-analysis for Research and Applications (MERRA) was developed as 
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a tool to better understand NASA’s remote sensing data in a climate context 
(Rienecker et al. 201 1). The NCEP Climate Forecast System Reanalysis (Saha et al. 
2010) became available in early 2010, produced with a data assimilation system 
that includes precipitation assimilation over land, and a semi-coupled ocean/land/ 
atmosphere model and intended for seasonal prediction initialization. This is a brief 
description of the latest atmospheric reanalyses. The basic information about the 
data can be found at http://reanalyses.org/atmosphere/comparison-table, along with 
similar information for the latest oceanic reanalyses. 

While the fundamental strength in resolving dynamical processes remains, recent 
reanalyses have improved on many aspects of the earlier-generation systems. Direct 
assimilation of the remotely-sensed satellite radiances, rather than assimilation of 
retrieved state estimates, has become the norm. Variational bias correction of the 
satellite radiances effectively anchors these data to high-quality observations from 
radiosondes and other sources (Dee and Uppala 2009; used in ERA-Interim, 
MERRA, and CFSR as well as the forthcoming JRA-55). The recently completed 
CFSR is the first reanalysis to use a weakly-coupled ocean/atmosphere model, 
and also assimilates precipitation data over land. In addition to the technical and 
scientific improvements of the reanalysis systems, increased computational resources 
allow the use of higher-resolution models that better resolve the observations. 
These advances combined have lead to improved representations of many physical 
parameters and processes in reanalyses, for example improved skill of the large-scale 
global and tropical precipitation (Bosilovich et al. 2008, 2011). In addition, the need 
for reanalyses to contribute to climate change studies has prompted significant 
innovations. For example, the twentieth century Reanalysis (20CR) project carried 
out by NOAA in collaboration with CIRES uses the available global surface 
pressure observations and sea surface temperature record reconstructed through the 
1870s in an ensemble-based global analysis method. The resulting analysis is able 
to produce weather patterns with the quality of a modern 3-day numerical forecast 
(Compo et al. 201 1). 

Even with substantial improvements, assessment of the uncertainties in reanalysis 
output, especially in the physical processes needed to study climate variations and 
change, remains a significant concern. For a more complete picture of the climate 
system, as represented by reanalyses, the impact of the observations on the resulting 
data should be captured in the analysis of the physical processes (as in Roads et al. 
2002). Even the most recent reanalyses demonstrate, to varying degrees, shifts in the 
time series that can be related to changes in the observing systems being assimilated 
(Dee et al. 2011a, b; Saha et al. 2010; Bosilovich et al. 2011). These shifts, which 
may be due to changing biases in the observations, systematic errors in the assimi- 
lating model, or both, interfere with the ability to detect reliable climate trends from 
the reanalyses. While there are some post-processing techniques that may address 
these spurious features (Robertson et al. 2011), dealing with biases in models and 
observations remains the most difficult challenge for the reanalysis and data assimi- 
lation community in developing future generations of climate reanalyses. 

The number of global reanalyses has increased greatly in recent years, as com- 
puting improves, and various entities have need for specific missions to support. 
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496 Furthermore, spanning the various Earth system disciplines shows that uncoupled 

497 ocean and land reanalyses are being performed as regularly as those for the atmosphere 

498 (Guo et al. 2007; Xue et al. 2011; an evolving list of reanalyses is maintained at 

499 reanalysis.org). Regional reanalyses attempt to improve upon the local representa- 

500 tion of climate and processes that must be handled more generally in global systems 

501 (Mesinger et al. 2006; Verver and Klein Tank 2012). While this increase in new 

502 reanalyses can cause additional work for the research community in understanding 

503 the various strengths and weaknesses, it does provide opportunity to more quantita- 

504 tively investigate the uncertainties of the reanalysis data. For example, in studying the 

505 global water and energy budgets Trenberth et al. (2011) characterized the range of 

506 values for each term. In addition, collections of analyses have been used to derive a 

507 super ensemble mean and variance for the ocean (Xue et al. 2011), land (Guo et al. 2007) 

508 and atmosphere (Bosilovich et al. 2009). While the ensembles can expose biases in 

509 the character of various reanalyses, there is some evidence that the ensemble itself 

510 can also provide reasonable data from weather to monthly timescales. Despite the 

511 difficulties in dealing with a large amount of data, a researcher will find more 

512 advantage to have multiple data sets available for study. Just as several coupled 

513 model integrations are required for present day and future climate projections, 

514 multiple reanalyses will better contribute to the characterization of present day 

515 climate. Reanalyses may well benefit from common data standards that facilitate 

516 evaluation and analysis of the IPCC climate change experiments. 


517 2.2 Integrating Earth System Analyses 

51 8 Observations are the critical resource for a reanalysis, which needs as many as possible 

519 to characterize the state of the Earth system. As decadal predictions begin to play a 

520 role in understanding near-term climate variations, the Earth system ocean/land/ 

521 atmosphere needs to be initialized in a balanced state. Newer measurements, such 

522 as aerosols, sea ice and ocean salinity contribute to the need for reanalyses that 

523 encompass the broad Earth system. Therefore, Integrated Earth Systems Analysis 

524 (IESA) encompasses the connections of these disparate observations, and have 

525 become an important challenge for data assimilation development. 

526 NCEP CFSR provides a reanalysis produced with a semi-coupled ocean/land/ 

527 atmosphere model, along with an analysis of land precipitation gauge measurements 

528 (Saha et al. 2010). Development of the next reanalysis from NASA includes 

529 aerosols, ocean (temperature and salinity), land (soil water) and ocean color (biology) 

530 analysis. While there are significant difficulties in both the modeling and assimila- 

531 tion of the integrated Earth system, extending these more complex reanalyses to 

532 historic periods, when little or none of the diversity in observations is available 

533 will require even more effort on addressing the impact of changes in the observing 

534 systems. Likewise, maintaining and expanding many of the Earth observations for- 

535 ward in time is also a critical issue (Trenberth et al. OSC position paper on observing 

536 system), and reference networks can provide stable benchmarks for reanalyses 
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and their data assimilation. Consistency and overlap of newer systems will help 
maintain the consistency in the integrated reanalyses. 

2.3 Reanalysis Input Observations 

Essentially, reanalyses without input observations revert to model products, hence 
the importance of the observing system emphasized here. As discussed previously, 
there are numerous value added advantages from reanalysis, but they cannot replace 
observed data. It is very important, especially for new reanalysis users, to understand 
that reanalyses are not observations, but rather, an observation-based data product. 
Since reanalyses combine many types of observations, their relative comparison 
should be valuable in assessing the quality of the observation as well. However, it is 
not always easy to determine which observations are included in the reanalysis at 
specific spatio-temporal coordinates. Any given observation will be weighted with 
other nearby observations and the model forecast in the assimilation process. It may 
be accepted or rejected, and if accepted will contribute to the overall analysis including 
other accepted observations. The degree to which an observation influences an 
analysis can be determined from the output background model forecast error and 
the analysis error (as discussed in Rienecker et al. 201 1). 

Such output data have been available from reanalysis and data assimilation 
products for some time, but generally only used by developers or those closely 
familiar with the data assimilation methodology. However, these assimilated obser- 
vations represent a key component in the output of the reanalyses, and can show 
which observations are used and how. For example, Haimberger (2007) used feedback 
information from ERA40 to better characterize inhomogeneities in the radiosonde 
time series, and this information was, in turn, used to improve the input observa- 
tions to both ERA-Interim and MERRA. To facilitate broader access, assimilated 
observations need to be provided in a format easily accessible to the reanalysis 
users, so that users can more appropriately identify the agreement between observed 
features (including all sources of a given state variable) and reanalysis features at 
any specific point in space and time. Even just the capability of easily determining 
the presence (or lack thereof) of assimilated observations during a given event 
would be useful in many research studies. Typically, the data is produced in 
“observation-space”, in that, it is an ascii record including space and time coordinates. 
To facilitate comparisons with the gridded reanalysis output, the GMAO has 
processed MERRA’s assimilated observations to its native grid (Rienecker et al. 
2011) called the MERRA Gridded Innovations and Observations (GIO). It includes 
each observation, its forecast error and analysis error (as well as the count of obser- 
vations and variance within the grid box). Similarly, recent efforts at ECMWF aim 
to make assimilated observations and the “feedback” files available through a WWW 
interface. With these data, researchers can quickly identify the observation assimilated 
at each of the reanalysis grid points. 

Of course, reanalyses rely on the broad and open availability of increasing numbers 
of observing systems and variables. Regarding in situ (or sometimes referred to as 
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579 conventional) observing networks, reanalysis projects have been able to coordinate 

580 and update data holdings to reflect the latest quality assessments and reprocessing 

581 of the data. For the remote sensing data, however, there remains much less organiza- 

582 tion of the data and how it is used in reanalyses. As part of preparations for a new 

583 comprehensive climate reanalysis, an inventory of satellite radiances potentially 

584 available for reanalysis is currently being compiled at ECMWF. Some remotely 

585 sensed data is still assimilated as retrieved state fields, instead of radiances, and is 

586 therefore a function of the algorithm or radiative transfer model and its version, as 

587 well as the version of the input radiance. 

588 There is significant work progressing on the radiances themselves that should 

589 affect their use in reanalyses. For example, intercalibrated MSU (channels 2 - 4 ) (Zou 

590 et al. 2006) were newly available and assimilated from the start of MERRA produc- 

591 tion, but this was not an option for reanalyses beginning prior to it. The satellite data 

592 input is generally handled by the reanalysis center, which must maintain contacts 

593 with the data community to be informed on all the latest information and updates. 

594 Presently, each center documents its own data usage, but there is no central information 

595 about this for research users to access and intercompare among reanalyses. As dis- 

596 cussed earlier, observations are the key resource for reanalysis, reanalysis are sensi- 

597 five to the assimilated observations and so, it is vitally important for reanalysis 

598 projects to have the latest information and reprocessing of the input data type, and 

599 also convey that information to the research community. The series of international 

600 reanalyses conferences have provided a focal point for discussions on the accom- 

601 plishments, challenges and future directions of reanalyses (e.g. jra.kishou.go.jp/3rac_ 

602 en.html and icr4.org). Additionally, a grass roots effort to open communication 

603 among reanalysis developers and the research community leveraging internet com- 

604 munication technology has begun and is gaining momentum (reanalysis.org). 

60 s 2.4 Recommendations 

606 1 . The research community and reanalysis developers benefit from the availability 

607 of multiple international reanalysis products. Researchers should be encouraged 

608 to use as many as possible to better define the uncertainty of reanalyses. Data 

609 management practices and utilities should be developed to facilitate intercom- 

610 parison among reanalyses. 

61 1 2. Given the criticality of observations and their quality in reanalyses, efficient and 

612 open communications among the reanalyses developers and observation develop- 

61 3 ers/stewards needs to be enhanced. Likewise, information on how the observations 

614 are used in the reanalysis can be used by the observation developers and research 

615 community. Reanalysis developers should be encouraged to provide the assimi- 

616 lated observations and innovations alongside the characteristic reanalysis data. 

617 3. Interdisciplinary coupled modeling and assimilation across the atmosphere 

618 (including aerosols and the stratosphere), ocean, land and cryosphere needs 

619 significant advancement and communications to accomplish the long-term goals 

620 of integrated reanalyses. 



Author's Proof 

On the Reprocessing and Reanalysis of Observations for Climate 

3 Future Directions 

Global data products and their further refinement will continue to be a critical 
resource for understanding the Earth’s climate, variability and change. Not only is 
reduction of uncertainty for any individual product important, through improved 
algorithms and processing, but also, global data must be physically integrated and 
consistent in their use of ancillary information and consistency in assumptions. 
These considerations are leading to more formal assessments of global data 
products, such as those put forward by the GEWEX Data and Assessment Panel 
(e.g. Gruber and Levizzani 2008). 

A substantial amount of observations are not regularly analyzed in present day 
research projects because it has yet to be digitized. Projects and initiatives concerning 
data digitization and archiving of basic observations urgently need to be imbedded 
in an overarching, sustainable, fully funded and staffed international infrastructure 
that oversees data rescue activities, and compliments the various implementation 
and strategy plans and documents on data coming out of international coordinating 
agencies. Terrestrial and marine data efforts need to be integrated and better linked 
up under an international framework that supports their activities. An archive of 
observational data sets analogous to the CMIP archive of model data, should be 
established and integrated with user-oriented information such as the Climate 
Data Guide. 

The reanalysis developer and user community has increased substantially over 
the last decade, mostly due to the broad utility of the data. This paper has addressed 
some of the most pressing challenges facing the international reprocessing and 
reanalysis communities. WCRP has been an integral partner in the development of 
reprocessing and reanalyses, fostering communications within the community 
through workshops, conferences and its scientific panels. Recently, reanalyses data 
have been discussed and considered in the derivation of Essential Climate Variables 
(ECVs), as well as using the data for climate monitoring and information services ( Dee 
et al. 201 lb). Assessment of global data products is also a major issue for ECVs. 

As can be easily seen in the overview summary of reanalyses, the reanalysis 
systems are evolving and growing. There will be newer, more advanced and 
comprehensive reanalysis data products available in coming years. Regarding 
the most recent reanalysis data products, there are many questions on their relative 
performance for the many uses and regions covered. It is not feasible for any one 
institution to be able to fully address the exact quality among all the reanalyses, 
simply because there are too many applications of reanalyses. While this does put 
the burden of intercomparison on the individual researcher, in quite a few instances, 
communication and sharing of knowledge between users and developers will have 
become critically important. In a grass roots effort to address the communications 
issues, an effort to utilize the internet and live documents has begun, to provide 
a forum that facilitates communication within the reanalysis community. It is 
considered a pilot project, and is called reanalyses.org. At this site, developers 
can contribute to a central knowledge-base regarding all issues of reanalyses. 
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664 In addition, reanalyses.org provides a function to allow users to compare reanalyses. 

665 In the long run, users are encouraged to summarize their results with pointers to 

666 detailed information and ultimately publications on the ongoing efforts. While this 

667 should not be the sole effort to facilitate communications, it does provide an outlet 

668 and focal point for anyone in the community. The Climate Data Guide (climate- 

669 dataguide.ucar.edu) provides concentrated information and expert analysis of many 

670 reprocessed data set, data sources for reanalysis and the reanalyses themselves. 

671 Another platform, the Earth System Grid (ESG) is under development and will 

672 allow users to easily compare the existing reanalyses with observations and also 

673 CMIP present day simulations. While significant challenges remain, the active 

674 communities of users and developers have numerous avenues of information and 

675 interaction to pursue the solutions. 
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